1—5 

in 






An MCMC Approach 

to Universal Lossy Compression 

of Analog Sources 

Dror Baron 

Department of Electrical and Computer Engineering 

North Carolina State University; Raleigh, NC 

Email: barondror@ncsu.edu 



^ Tsachy Weissman 



Department of Electrical Engineering 

Stanford University; Stanford, CA 

Email: tsachy@stanford.edu 



Abstract 



> 

^v^ Motivated by the Markov chain Monte Carlo (MCMC) approach to the compression of discrete 

j^ sources developed by Jalali and Weissman, we propose a lossy compression algorithm for analog sources 

^^ that relies on a finite reproduction alphabet, which grows with the input length. The algorithm achieves, 

(^~>) in an appropriate asymptotic sense, the optimum Shannon theoretic tradeoff between rate and distortion, 

universally for stationary ergodic continuous amplitude sources. We further propose an MCMC -based 

^ algorithm that resorts to a reduced reproduction alphabet when such reduction does not prevent achieving 

^> the Shannon limit. The latter algorithm is advantageous due to its reduced complexity and improved rates 

%^ 

^ of convergence when employed on sources with a finite and small optimum reproduction alphabet. 



I. Introduction 

Lossy compression of analog sources is a pillar of modern communication systems. Despite numerous 
applications such as image compression [2,3], video compression [4], and speech coding [5-7], there is 
a significant gap between theory and practice. 

Much of the research was performed when the first author was with the Electrical Engineering Department at the Technion, 
Israel. Subsets of the work appeared in [I]. 



A. Entropy coding 

Many practical lossy compression algorithms employ entropy coding, where scalar quantization is 
followed by lossless compression (ECSQ). ECSQ has motivated much work into optimization of scalar 
quantizers [8-10], whereas the translation to bits can use Huffman [11] or arithmetic [12,13] codes. 
Despite the simplicity and elegance of ECSQ, even for independent and identically distributed (iid) 
sources the rate distortion (RD) function [13, 14], which characterizes the fundamental limit for lossy 
compression, suggests that ECSQ-based coding can be highly suboptimal (cf . Figure [T] for an example). 
For non-iid sources, ECSQ may compare even less favorably with the fundamental RD limit. 

In order to bridge part of the gap between ECSQ and the RD function, vector quantization (VQ) 
converts an entire vector to a codeword [7, 15, 16], in contrast to scalar quantization, which compresses 
individual scalar input elements. VQ provides a better trade-off between rate and distortion as the vector 
dimension increases, but increased complexity is required [17]. The significant computation required by 
VQ necessitates developing computationally feasible alternatives. 

B. Related work 

For finite alphabet sources, recent advances have demonstrated that the RD limit can be approached 
asymptotically [18-20] by partitioning an input into sub-blocks, where a Shannon-style codebook [13, 
14] is applied to each sub-block. Some of these schemes can compress universally without knowing the 
source statistics beforehand, but it is challenging to generate a codebook distribution whose statistics 
differ from those of the input statistics [21]. 

Lossy compression over a finite alphabet can also be performed by directly mapping the entire input 
to an output sequence while accounting for the trade-off between the compressibility of the output and 
the distortion between the input and output sequences. This optimization can be deterministic [22] or 
stochastic [23] in nature. Directly mapping to the output sequence effectively quantizes the entire input 
- a long sequence - into a large output codebook, and achieves the RD Umit for stationary ergodic finite 
alphabet sources universally. Another promising recent approach to (non-universal) lossy compression 
relies on algebraic codes [24]. 

For analog sources, less progress has been made in developing theoretically-justified compression 
algorithms. Some results have been derived specifically for the high-rate regime, where the Shannon 
lower bound is asymptotically tight [25]. In particular, in the limit of low distortions the RD limit has 
been characterized for mixtures of probability distribution functions (pdf 's) where one distribution is 
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Fig. 1. Laplace source: Comparison of entropy coding (ECSQ), results by Yang and Zhang [33], average rate and 
distortion of Algorithm 2 (MCMC) over 10 simulations, and the RD function, (n = 1.5 • 10**, \Z\ = 9, r = 50, 

kKi ilog|2|(n).) 



discrete and the other continuous [26,27]. For example, the sparse Gaussian source is a mixture pdf; 
bounds on its RD function have been provided [28-31]. 

Despite the theoretical insights in the high-rate regime, compression of analog sources at low-to-medium 
rates is of interest in many applications [2-5]. There do exist special input pdf 's for which entropy coding 
approaches the RD function [32] in the low-rate limit, but the low-rate regime is challenging in general. 
We aspire to develop results of general applicability and not be limited to specific pdf 's with fortuitous 
properties. 



C. Contributions 

The crux of our approach to the compression of analog sources is to quantize to discrete reproduction 
levels, and then apply a compression algorithm similar to that of Jalali and Weissman [23], which uses 
the stochastic optimization approach of Markov chain Monte Carlo (MCMC) simulated annealing, as 
championed in [34]. A careful choice of the set of reproduction levels, growing appropriately with input 
length both in size and in resolution, achieves the RD function despite the analog nature of the source. 
A somewhat similar approach was suggested by Yang et al. [22,33] using deterministic optimization 



techniques. Note, however, that Yang and Zhang [33] require availabiUty of a training sequence, and so 
their algorithm is not universal. Although it is possible to apply deterministic optimization in a universal 
setting by partitioning the input into blocks, this approach results in a performance loss of 0.2-0.3 dB [33]. 

Our first contribution is a lossy compression algorithm for analog sources that relies on a data- 
independent reproduction alphabet that grows with the input length. This algorithm asymptotically 
achieves the RD function universally for stationary ergodic continuous amplitude sources. However, the 
reproduction alphabet grows with the input length, slowing down the convergence to the RD function, 
and is thus an impediment in practice. 

To address this issue, we next propose an MCMC -based algorithm that uses an adaptive reproduction 
alphabet. The ground-breaking work by Rose on the discrete nature of the reproduction alphabet for 
iid sources when the Shannon lower bound is not tight [35] suggests that, for most sources of practical 
interest, restriction of the reconstruction to a rather small alphabet does not stand in the way of attaining 
the fundamental compression limits. Indeed, at low rates even a binary reproduction alphabet is often 
optimal [32]. When employed on such sources, our latter algorithm zeroes in on the finite reproduction 
alphabet, and thus enjoys rates of convergence commensurate with the finite-alphabet setting. 

In order to render this adaptive algorithm computationally feasible, we develop a method to update the 
optimal reproduction levels rapidly. Utilizing this computational feature, our adaptive algorithm provides 
faster computation, achieves the RD function universally, and in some cases the smaller reproduction 
alphabet accelerates convergence to the RD function. Consequently, the adaptive algorithm is more suitable 
in practice. We emphasize that our algorithms are both universal, requiring no knowledge of the source 
statistics. 

The remainder of the paper is organized as follows. We provide background information in Section [ill 
Our first, brute force algorithm is described in Section |lllj followed by the adaptive reproduction alphabet 
algorithm in Section IV Numerical results are reported in Section |V] We complete the paper with a 



discussion in Section |Vl| Proofs appear in appendices, in order to make the main portion of the manuscript 
easily accessible. 

II. Background 

A. Notation 

Consider a stationary ergodic source X = {Xi,i > 1} with real-valued components. We process a 
length-n input x" = xiX2 • • • x„, which is an individual realization of the random vector X"-. The input 
x"- is compressed using an encoder e : X"- — )■ {0, 1}+ that maps x" to a finite output string e(x"). 



The decoder d : {0, 1}^ — )■ y^ maps the bit string back to a length-n output y" over the reproduction 
alphabet y, which may be a continuous or discrete subset of the real line. The output y" is the lossy 
approximation of x". 

We assess the performance of an encoder-decoder pair relative to the trade-off between rate and 
distortion [13, 14]. The rate of such a pair is defined as i? = E'[-|e(X")|], the expected number of bits 
per description of a source symbol, where | • | denotes length, size, or cardinality, and E[-] is expectation. 
The distortion D = E[dn{X'^,y"')] quantifies the expected per-symbol distortion, 

d„(x",y-)^-Vd(x„y,), (1) 

n ^-^ 

i=l 

where d : M x M — ;■ M+ measures the distortion. For concreteness in what follows, we assume the 
distortion is the square of the error d{xi,yi) = {xi — yif', but our approach readily carries over to 
accommodate general distortion measures. 

B. Lossy compression using MCMC 

We describe a variant of the scheme in [23] that compresses an input a;" to an output y" over a finite 
alphabet 3^ C R. This algorithm will later be employed as the main building block for compressing an 
analog source. The encoder approximates x" by y", which is compressed using the context tree weighting 
(CTW) universal lossless compression algorithm{^ The approximation y" is chosen to provide a good 
trade-off between the coding length required for y" and the distortion with respect to x". The decoding 
procedure is straightforward; the output bits are passed through the CTW decompressor to retrieve y". 

Denote the empirical symbol counts by mfc(y", u^)[a], i.e., 

mk{y'',u^)[a] ^\{k<i<n: y\_j,=v!'a]\, 

where k is the context depth, a G 3^, u'^ G 3"^, and u^a denotes concatenation of u^ and a. Define the 
fe-depth conditional empirical entropy as 

where log(-) is the base-two logarithm, and we use the convention wherein Olog(O) = 0. For k = 
o(log(n)), the difference between the CTW coding length and the empirical conditional entropy is 

' We prefer CTW [36], because for context tree sources it has lower redundancy than Lempel-Ziv based schemes [37] or 
adaptive arithmetic coding based on full-tree Markov models [33]. 



o(l) [36]. We define the energy e(y") corresponding to y" by 

e(y")^n[i/fc(y")-/3d„(x",y")], (3) 

where /3 < is the slope of the RD function at the point we want to attain. The Boltzmann probabiUty 
mass function (pmf) is 

/,(y")^^exp{-.£(y")}, (4) 

where s > is inversely related to temperature in simulated annealing [34], and Zg is the normalization 
constant. 

Ideally, our goal is to compute the globally minimum energy solution x", 

x" = arg mill e(ii;") = arg min [Hk{w'^) — fidnix"^ .w"^)]. (5) 

Computation of x" involves an exhaustive search over exponentially many sequences and is thus infeasi- 
ble. We use a stochastic Markov chain Monte Carlo (MCMC) relaxation [34] to approximate the globally 
minimum solution, in contrast to the deterministic approach of Yang et al. [22]. We denote the resulting 
approximation by y". 

To sample from the Boltzmann pmf Q, we examine all n locations. For each location, we use a Gibbs 
sampler to resample from the distribution of yj conditioned on y"\* = {y„ : n / i} as induced by the 
joint pmf in Q, readily computed to be 

/,(y, = a|y ) = ^^g^p|_^ [nA/7fc(y-i6y7:,,, a) - /3A(i(6,a,x,)] } ' ^^^ 

where AHk{y^^^byf^i^, a) is the change in Hk{y'^) when yi = a is replaced by h, and Ad(6, o, Xj) = 
d{b,Xi) — d{a,Xi) = {b — Xif' — (a — Xif is the change in distortion. We refer to the resampling from 
a single location as an iteration, and group the n possible locations into super-iterations jj 

During the simulated annealing, the inverse temperature s is gradually increased, where in super- 
iteration t we use s = 0(log(t)) [23,34]. As the number of iterations t is increased, y" converges 
in distribution to the set of minimal energy solutions, which includes x" (|5|, because large s favors 
low-energy y". Pseudo-code for our encoder appears in Algorithm 1 below. 

^ We recommend an ordering where each super-iteration scans a permutation of all n locations of the input, because in this 
manner each location is scanned fairly often. Other orderings are possible, including a completely random order as prescribed 
by Jalali and Weissman [23]. 



Algorithm 1: Lossy encoder with fixed reproduction alphabet 



Input: x" g M", y, p, c, r 
Output: bit-stream 

Procedure: 

1) Initialize y by quantizing x with y 

2) Initialize mk{-,-) using y 

3) for t = 1 to r do // super-iteration 

4) s ■^ clog(i) for some c > // inverse temperature 

5) Draw permutation of numbers {1, . . . , n} at random 

6) for t' = 1 to n do 

7) Let i be component t' in permutation 

8) Generate new yj using fs{yi = -ly"^*) given in (pi) // Gibbs sampling 

9) Update mfe (•,•)[•] 

10) Apply CTW to y" // compress outcome 



III. Universal algorithm with data-independent reproduction alphabet 

Let us consider how Algorithm 1 can be used to compress analog sources. We will see that choosing 
the reproduction alphabet 3^ to be a finite subset of R (but growing with the input length n in a data- 
independent way) achieves the RD function. 

Let us assume that the variance of source symbols emitted by X is finite, and consider the following 
data-independent reproduction alphabet, 

y^i_±^_t^^,,,X\ 7=[log(n)l, (7) 

It 7 7 J 

where [•] denotes rounding up. In words, 3^ is a quantization of the interval [—7,7] to resolution 1/^/7. 



Other choices of y also allow to demonstrate various RD results; an examination of \12\ indicates that 
slower-growing 7(n) also achieves the RD function. The essential point is that y quantizes a wider 
interval with finer resolution as n is increased, and its size increases sufficiently slowly with n. 

To prove achievability of the RD function asymptotically, we first prove that a global optimization 
(JSJ) that determines x" followed by lossless compression with CTW [36] achieves the RD function. 
Yang et al. [22, 33] proved a similar result for their deterministic algorithm while relying on a different 
reproduction alphabet; our contribution is to prove achievability using the data-independent reproduction 
alphabet y. 

Theorem 1: Consider square error distortion ([T]l, let X be a finite variance stationary and ergodic source 
with RD function R{X^ D), and use the data- independent reproduction alphabet 3^ dTJ) to approximate x" 
by the globally minimum energy solution x" (|5|. Then the length of context tree weighting (CTW) [36] 
applied to x" converges as follows, 

1 



lim sup-E 



\CTW{x^)\ -Pdnix^^x^^ 
n 



<m\xi\R(X,D)- I3D]. (8) 

D>0 



Note that the limsup in ([8]l is actually a limit since the expectation on the left hand side is lower 
bounded by the right hand side for any scheme and any n, cf., e.g., [22,23]. The detailed proof appears 
in Appendix A and we feature some highlights here. In order to prove achievabiUty for the continuous 
alphabet source X, we construct a near-optimal codebook for a given input length n [14], and then 
quantize the components of every codeword in the codebook to y. As n is increased, y quantizes 
a wider interval of values more finely. The wider interval ensures that outlier source symbols have 
a vanishing effect on the distortion, and finer quantization provides near-optimal distortion within the 
interval. Therefore, we have achievabiUty for the continuous amplitude source X via the finite alphabet 

y. 

Now consider running Algorithm 1 instead of the global energy minimization Q using the data- 
independent reproduction alphabet y. The constant c used in Line |4| of Algorithm 1 plays a crucial role. 
If c is large, then the Boltzmann pmf Q favors low-energy sequences too greedily, and the algorithm 
might get stuck in local minima. On the other hand, there exists a universal constant ci such that for 
c < ci we obtain universal performance. To understand why this happens, observe that Algorithm 1 
optimizes over |3^|" possible outputs. As long as c < ci, there is a sufficiently large probability to 
transition between any two outputs, and the algorithm cannot get bogged down in a local mimimum. 
Therefore, in the limit of many iterations Algorithm 1 converges in distribution to the set of minimal 
energy solutions, and we enjoy the same RD performance as in Theorem [T] We refer the reader to Geman 



and Geman [34] for further discussions relating to the choice of ci. The proof appears in Appendix B 

Theorem 2: Consider square error distortion ([T]l, let X be a finite variance stationary and ergodic 
source with RD function R{X, D), and use Algorithm 1 with the data-independent reproduction alphabet 
3^ (M) and sufficiently small c < ci. Let y" be the MCMC approximation to x" after r super-iterations. 
Then the length of context tree weighting (CTW) [36] applied to y" converges as follows, 

1 



lira lini E 

n— >oo r— >oo 



m\n\RiX,D)- pD]. 
D>0 



n 
An important feature of the algorithm is that each iteration of Lines [7]-^ requires computation that is 

proportional to the context depth k and alphabet size |3^|, independent of n [23]. Because the alphabet 

grows slowly in n, the per-iteration computational costs are modest. Each super-iteration contains n 

iterations, and so its computation is 0{nk\y\) = o{n\o^{n)). Decoding is also fast. We first decompress 

CTW [36], and the finite alphabet is then mapped to our data-independent reproduction alphabet y. 

It is also noteworthy that our results could be modified to support other distortion metrics. For example, 

if we used ip distortion, then a technical condition £'[|X|p] < oo ensures that outliers do not increase 



the distortion by much. 

Although promising from a theoretical perspective, Algorithm 1 is of limited practical interest. In 
order to approach the RD function closely, y may need to be large, which slows down the algorithm. 
One approach to improve the algorithm is to encode outlier source symbols, i.e., |xj| > 7, explicitly 
using Rs log(|a;j|/7) bits, perhaps using a universal code for integers [38]. This encoder would reduce 
the distortion caused by outliers, thus allowing to use a narrower interval, yielding a reduction in the 
alphabet size. We leave the study of outlier processing for future work, and focus instead on using an 
adaptive reproduction alphabet to improve the algorithm. 

IV. Adaptive reproduction alphabet algorithm 

Our approach to overcome the disadvantages of large alphabets (Section [Hi]) is inspired by the ground- 
breaking work by Rose on the discrete nature of the reproduction alphabet for iid sources when the 
Shannon lower bound is not tight [35]. In many cases of interest, a small reproduction alphabet achieves 
the RD function of an analog source. Indeed, at sufficiently low rates even a binary reproduction alphabet 
is sometimes optimal [32]. We thus focus on an algorithm that, while supporting the possibility that the 
reproduction alphabet must be large, also supports a possible reduction of the alphabet size, while allowing 
the actual reproduction levels to adapt to the input. 

A. Adaptive reproduction levels 

Following the approach of Yang and Zhang [33], we map the input x" to a sequence 2" over a finite 
alphabet Z, where the actual output y" is derived via a scalar function yi = a{zi). Ideally, the function 
a(-) should minimize expected distortion. Because we focus on square error distortion, the optimal a*(-) 
is the conditional expectation [33], 

a*{a) = E[xi\zi = q], Va G Z. (9) 

The decoder does not know x", and cannot compute a*{-). Therefore, we encode the numerical value 

a* {a) for each a ^ Z. We allocate /ulog(log(n)) bits to encode each 

^ ra*(«)A] 
ag(") = -^ > (10) 

where a*{a) is a quantized version of a* (a), and the quantizer resolution -^ depends on ji and the width 
of the interval being quantized. We observe that it might be advantageous to allocate more bits to encode 
aq{a) for symbols a ^ Z that appear more times in z", but leave such optimizations for future work. 



10 



Nonetheless, if some a ^ Z does not appear in z'\ then there is no need to encode its numerical value. 
We expend one flag bit per character of Z to describe the effective alphabet Z^. = Ze{z^), where Z^.'^Z 
is the subset of the reproduction alphabet Z that appears in z". Because \Z\ = \y\ = 2[log(n)]^ + 1, 
only 0(log^(n)) flag bits are needed. In fact, it suffices for the encoder to describe the cardinality of Z^ 
using 0(log(log(n))) bits, which is insignificant. 

The energy function ([3]) must be modified to support adaptive alphabets as follows, 

£,(z")^n[//fe(z")-/34(x",z")]+/xlog(log(n))|Ze(z")|, (11) 

where ^log(log(n))|Ze| bits are used to encode the reproduction levels that appear in the effective 
alphabet Zf., da{x",z"') is distortion with the adaptive alphabet, 

1 " 

d„(x", z") = d„(x", a*(z")) = ^ J](x, - a;{zi)f, (12) 

1=1 
a*(z") is shorthand for the n-tuple obtained by applying a* to the components of z", and a*{-) is 

computed using (|9]) and ( [TO] ). These definitions require to modify the previous Gibbs sampler ([6]) as 

follows, 

f.izi = a|z"V) 

= (13) 

Zb exp {-s [nAHkiz'-^zf^^,a) - f3Adaib, a, z") + ^ log(log(n))AZe(6, a)] } ' 

where 

Adaib, a, z") ^ n [da{x^, z'-'bz^^,) - daix"", z'-'az^^,)] (14) 



is the change in distortion using the adaptive alphabet (12 1, and AZe{b,a) is the change in the size of 



the effective alphabet when Zj = a is replaced by b. Alternately, the optimization routine can loop over 



different alphabet sizes \Z\ without accounting for \Z\ in the energy (111; this latter approach was used 
in our simulations (Section |V]). 

The crux of the matter is that if a reduced alphabet yields similar distortion results without increasing 
the coding length, then the modified energy function ( [TT] ) induces a smaller effective Z^.. Motivated by 
the theoretical results by Rose [35] and our numerical results (Section [V]), for many analog sources of 
practical interest a small alphabet offers good and in many cases optimum RD performance. In such 
cases, the adaptive alphabet algorithm is advantageous. 

Even if the entire alphabet is used, i.e., Zg = Z, then the location of the reproduction levels is 
optimized via a*(-) in lieu of the uniform quantization used in y (M). Consequently, if we allow the 
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adaptive alphabet algorithm to use Z with the same cardinality of 3^ as in Algorithm 1, then the RD 
performance can only improve. 

We now state formally that the adaptive alphabet algorithm achieves the RD function asymptotically 
without prior knowledge of the source statistics. As before, our result relies on the existence of a universal 
constant C2 such that for c < C2 the transition probabilities between the \Z\^ possible outputs are 
sufficiently large. 

Theorem 3: Consider square error distortion ([T]|, let X be a finite variance stationary and ergodic 
source with RD function R{X,D), use Z with cardinality \Z\ = 2[log(n)]^ + 1 and sufficiently small 
c < C2 in Algorithm 2, and let 0^(2;") be the MCMC approximation to x" after r super-iterations. Then 
the length of context tree weighting (CTW) [36] applied to 2;" converges as follows, 

1 



lim lim E 

n— ^00 r— >-oo 



n 
The formal proof appears in [Appendix C| The key point is that adaptive reproduction levels offer 



m\n\RiX,D)- pD]. 
D>0 



pointwise improvement over the data-independent reproduction alphabet y from Section III per the 
same alphabet size. 

B. Fast computation 

An important contribution by Jalali and Weissman [23] was to show how to compute AHk{y^^^byf_^i, a) 
and Ad{b, a, Xi) rapidly. Without this computational contribution, the encoder would be impractical. The 
adaptive algorithm updates /S.Hk{z^~'^hzfj^T^, a) in an analogous manner. However, whereas Aci(6, o, Xi) = 
{b — Xi)^ — {a — Xi)^ is trivial to compute for the data-independent reproduction alphabet y (Ml, in our 



case A(ia(6, a, z") (14i requires to re-compute da{--, ■), which depends on a*{-). Unfortunately, modifying 
a single location in z" may change the distortion for numerous symbols. 

We now show how to compute Ada{b, a, z") rapidly for the adaptive reproduction alphabet algorithm. 
To do so, we evaluate da{x"-,z"'), 

1 "" 

4(:e",z") = -Vd(:c,,a*(z,)) (15) 

n ^ — ' ^ 



n 

i=l 



JE E (-^ -<(«))' (16) 



n 

a&Z {i: Zi=a} 



2a*Ja) [x,] + «(a))2 [1] \ , (17) 



where (15 1 uses the definitions of d„(-,-) and da{-,-) in (fTl) and (12), respectively, and (16 1 partitions 
Zj, i G {1, . . . , n}, into the different symbols a ^ Z and invokes the definition of square error distortion. 
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Combining ^ and ( [TO] ), 



*(^ \E[xi\z^ = a]A] 






A A 



(18) 



We see that (17i and ([18]) rely extensively on 



^™= E [(^^H' "^e{0,l,2},aGZ, (19) 

{i: Zi=a} 

the TTi'th moments of the portion of x where Zi = a. In each iteration of the algorithm, a single Zi may 
change from a to a'. Consequently, we modify X™ and X™, m G {0,1,2}, by adding and subtracting 
powers of Xi. Given these updated values, the computation of Ada (6, cl, z"-) is rapid, as before. Pseudo- 
code for the adaptive alphabet Algorithm 2 appears below. 

Algorithm 2: Lossy encoder with adaptive reproduction alphabet 



Input: x" G M", Z, f3, c, r 
Output: bit-stream 

Procedure: 

1) Initialize z by quantizing x II can quantize with data-independent y 

2) Initialize mj^{-, ■) and other data structures using z 

3) for t = 1 to r do super-iteration 

4) s ■^ clog(t) for some c > II inverse temperature 

5) Draw permutation of numbers {1, . . . , n} at random 

6) for t' = 1 to n do 

7) Let i be component t' in permutation 

8) for all a in ^ do // evaluate possible changes to Zi 



9) Compute A(i„(6, a, z") via ^, ^JM, and ([M 

10) Compute fs{zi = a\z'^\^) given in (13 )l/modifiedtjibbs distribution 

11) Generate new Zi using fs{zi = -Iz^^*) nOibbs sampling 

12) Update mk{-, •)[•] and X™, m S {0, 1, 2} // previous and new zi 

13) Encode effective alphabet Z^ 

14) Encode a*{a) using /xlog(log(n))|^e| bits 

15) Apply CTW to z" 



As for the data- independent reproduction alphabet case. Algorithm 2 requires 0(nk\Z\) time to 
compute AHk{z^~^bz^^^, a). Utilizing the computational techniques specified above, Ada{b, a, z") can be 
computed in constant time per inner loop of each iteration (LineM), which requires 0(n|^|) = 0(n|3^|) 
computation per super-iteration. We see that computing AHk{z'^^^bz^^i,a) should require more time 
than computing Ada{b,a, z'"'); this was confirmed in our implementation. 

We have also noticed empirically that Algorithm 2 often comes quite close to optimum RD performance 
after a few dozen super-iterations, resulting in reasonable overall computational demands. Additionally, 
in practice the effective alphabet Z^ is often modest. CTW [36] converges to the empirical entropy as 
long as kn = log(n)/log(|Ze|) — ^n(l). and for finite n a smaller alphabet \Z(,\ allows CTW to converge 
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to the empirical entropy for larger context depths A;„. Therefore, Algorithm 2 can optimize over deeper 
context trees, leading to improved compression and faster convergence to the RD function. 

The decoder of the adaptive reproduction alphabet Algorithm 2 resembles the decoder in [23]. First, 
the bit-stream generated by CTW is decompressed to reconstruct z^. The actual real-valued reproduction 
sequence is obtained by mapping from z" to y^ via the adaptive quantizer a* (a), since the mapping a* 
has been described to the decoder. 

V. Numerical results 

To demonstrate the potential of our approach, we implemented the adaptive alphabet Algorithm 2 in 
Matlab. Results for Laplace and autoregressive sources are provided. 

Implementation details: We ran Algorithm 2 for sequences of length n = 1.5 • 10^ using r = 50 
super-iterations, and k ^ ^ log|2|(?^)- We also found two heuristics to be useful. First, for each individual 
compression problem and RD slope /3 the specific temperature evolution s = 0(log(t)) may vary. 
Therefore, for each point we ran four temperature evolution sequences and allowed each one to improve 
over the energy ea{z^) computed with previous evolution sequences. Second, using a good starting point 
helps Algorithm 2 converge. Therefore, we began running low rate problems with small /3, and each 
solution was used as a starting point for the next larger f3. 

Below we plot results averaging over 10 simulations. Each plot compares the MCMC approach to 
entropy coding (ECSQ), results by Yang and Zhang [33], and the RD function. 

Laplace source: We first evaluated an iid Laplace source with pdf f{x) = ^e^'^' such that E[X] = 
and var(X) = 2. For this source, entropy coding performs rather well. However, Yang and Zhang [33] 
give better RD performance (Figure [T]). Algorithm 2 improves further over the deterministic minimization 
by Yang and Zhang [33], which requires availability of a training sequence. Their algorithm can be used 
in a universal setting by partitioning the input into blocks, resulting in a performance loss of 0.2-0.3 
dB [33]. 

Relying on the mapping approach of Rose [35], it can be shown that for low-to- medium rates a small 
odd number of reproduction levels suffices to approach the fundamental RD limit. We have observed that 
the optimal mapping a* (a) is similar to the reproduction alphabet computed by the mapping approach 
of Rose [35]. This similarity suggests that applying Algorithm 1 to the "correct" finite alphabet would 
not improve results by much. 

Autoregressive source: Figure |2] illustrates the RD performance of the different algorithms for an 
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0.8 1 

D [MSE] 



Fig. 2. AR source: Comparison of entropy coding (ECSQ), average rate and distortion of Algorithm 2 (MCMC) 
over 10 simulations, results by Yang and Zhang [33], and the RD function, (n = 1.5 • lO"*, \Z\e {3,9}, r — 50, 

k^ ilog|2|(n).) 



autoregressive (AR) source, where 



Xr, 



pXn-1 + Wn, 



p = 0.9, and the innovation sequence Wn ~ AA(0, 1) is zero mean unit norm iid Gaussian. 

Entropy coding (ECSQ) is not well suited for non-iid sources; vector quantization [17], the deterministic 
minimization algorithm by Yang and Zhang [33], and MCMC can be used instead. Note that ECSQ 
appears in the upper right hand side of the figure; its RD performance is poor. 

We plotted the RD performance of Algorithm 2 using a small reproduction alphabet {\Z\ = 3) and 
a moderately sized one {\Z\ = 9). At low rates, the smaller alphabet offers better RD performance; as 
the rate is increased, larger alphabets quantize the source more precisely. Although Algorithm 2 does 
not compress as well as Yang and Zhang (they describe the AR source as very challenging), recall that 
Algorithm 2 is universal. 
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VI. Discussion 

In this paper, we extended the MCMC simulated annealing approach of Jalali and Weissman [23] 
to analog sources. We described two lossy compression algorithms that asymptotically achieve the RD 
function universally for stationary ergodic continuous amplitude sources. The first algorithm relies on a 
data-independent reproduction alphabet that samples a wider interval with finer resolution as the input 
length is increased. However, the large alphabet slows down the convergence to the RD function, and is an 
impediment in practice. Our second algorithm therefore uses a (potentially smaller) adaptive reproduction 
alphabet. Not only is the adaptive algorithm theoretically motivated for iid sources by the discrete nature 
of the reproduction alphabet when the Shannon lower bound is not tight [35], but our numerical results 
suggest that even for non-iid sources it works well. Additionally, the smaller alphabet accelerates the 
computation. 

Applications: In applications such as image compression [2,3], video compression [4], and speech 
coding [5-7], our algorithms can process a vector of real- valued numbers whose statistics are either 
completely unknown, or perhaps only known approximately. The algorithms will then iterate over the 
data until some reasonable RD performance is attained. 

As an example, consider image coding. The EQ coder [3] processes each sub-band of wavelets 
sequentially, going from low-frequency sub-bands and proceeding toward high-frequency sub-bands. The 
EQ coder classifies wavelet coefficients in each sub-band based on the magnitudes of parent coefficients, 
relying on the insight that the magnitudes of children coefficients are correlated with the magnitudes 
of parents [39]. In a similar manner, our algorithms can utilize the parent coefficients as contextual 
information. 

Appendix A. Proof of Theorem[T] 

Continuous codebook: We begin by constructing a continuous amplitude RD codebook [14]. Given the 
slope (3 of the RD function R{X, D), there exists an optimal rate R{X, /3) and distortion level D{X, (3). 
For any ei > 0, fix the rate R = R{X, (3) + e\. The achievable RD coding theorem [13, 14] demonstrates 
for the source X that in the limit of large n there exist codebooks whose rates are smaller than R and 
whose expected per symbol distortions are less than D(X, /3). We choose such a codebook C comprised 
of at most 2^" codewords, each of length n. The encoder maps x^ to the nearest codeword Cj in C and 
transmits its index j. The decoder then maps index j to Cj. 

Quantized codebook: Now define a quantized codebook C such that cij, the z'th entry of the j'th 
codeword of C, is generated by rounding Cjj, the z'th entry of the j'th codeword of C, to the closest value 
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in 3^. (Recall that —7 and 7 are the smallest and largest values in 3^, respectively.) Using C, the encoder 
and decoder are identical, except that CiJ is used instead of Cij. The quantized codebook C requires the 
same rate as beforejj However, the distortion provided by the quantized codebook C is different. 

Distortion of quantized codebook: To analyze the change in distortion, we consider three cases. In 
the first case, the original codebook value is an outlier whereas the signal value Xi is not, i.e., \cij\ > 7 
and \xi\ < 7. The truncation of \cij\ to 7 reduces the distortion, 

uyXi^ Cij J — yXi ^ij ) ^ v-^i ^ij ) — ^\^i} Cij ) . 

The second case occurs when the original codebook and signal values are both outliers, i.e., |xj|, \cij\ > 7. 
As n — )• 00, the amount of variance beyond the increasing 7 = [log(n)] vanishes, E[{xi-l^\xi\>j{n)})'^] ~ — ^ 
0, because the source X has finite variance and 7 increases (|7]). Therefore, for any 61 > there exists A^i 
such that for all n > Ni the increase in expected distortion d„(a;",y") (ITll due to truncation of outliers 
is smaller than 61. The third case occurs for \cij\ < 7, where rounding changes the square error from 
{xi — Cij)"^ to {xi — Cij)"^, and the distortion changes by 

[Xi Cjj'j \^i Cij) — \Cij) yCij) ~r ^Xi\Cij Cij) 

— (,Cjj Cij)\ZXi Cij Cij) 

— (,Cjj Cij) Y^iyXi Cij) + \Cij Cij)\ . 

Because \cij — Cij\ < ^ ¥Jn, the change in distortion is upper bounded as follows, 

\yXi Cij) yXi Cij) I ^ + - 2- 

We now define sets of indices that relate to the three cases, 

Xi = {i : i G {!,... ,n}, |cij| > 7, l^il < 7}, 

X2 = {i- i £ {l,...,n},\xi\,\cij\> j}, 

X3 = {i: i G {l,...,n},|cij| < 7}. 

^ By quantizing Cj to cj, different Cj could yield identical codewords in C; this would allow to reduce the rate. 
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Summarizing over all i e {1, . . . , n}, 



E [ndnix'' ,Cj)] = E 



E 



< E 



.1=1 



-«j; 



+ 



E 






+ 



/ ^ (^« '^«i J 



ieXs 



.ieX2 



+S 



E(= 



+ 



K^i Ci-i 



7 



+ 



4^2 



< nS[d„(x",Cj)]+n(5i + 



E[\\x^-cjh] 



+ 



n 



(20) 



(21) 



7 47^' 

where Cj and cj are the j'th codewords of C and C, respectively, || • ||i denotes the ii norm, the inequality 



in (20 1 relies on the changes in distortion in the three different cases, and the inequality in ([21]) is 



due to the 7 terms related to X3 that do not appear for Xi and Z2. Because E[dn{x^,Cj)] < D and 
||x" — Cjlli < n^Jdn{x"-, Cj), we have via Jensen's inequality that i?[||x'^ — Cj\\i] < n^/T). Therefore, 



E[dn(x",cj)] <D + 6i + — + 



472 



D + 61 + 



+ 



1 



[log(n)] 4[log(n)]^ 



Because 7 = [log(n)] increases with n, 



E[d{x'',Cj)] <D + 25i 



(22) 



(23) 



Therefore, the quantized codebook C approaches the RD function asymptotically for the continuous 
amplitude source X. 

Lossless compression using CTW: Having demonstrated that there exists a codebook based on the 
finite alphabet C that asymptotically achieves the RD function, we need to prove that the RD performance 
of C can be approached by compressing x" losslessly using CTW [36]. The remainder of the proof borrows 
from the prior art on lossy compression of finite sources [22,23]. Owing to the linearity of expectation. 



E 



1 



n 



\CTW{x'')\- ^d{x'',x'' 



E 



1 



n 



|C7rVF(x")| -Hkix^- 



E 



iJfe(x") -/3d(x'',x^ 



. (24) 



Recall that k = kn = o(log(n)), and so for any €2 > there exists A'^2 such that for all n > N2 CTW 
converges to the empirical entropy [36], 

1 



E 



n 



-\CTW{x'')\ -Hkix'^) 



<e2, 



(25) 
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as long as kn = log(n)/log(|3^|) — i^n(l)- Jalali and Weissman [23] invoke Gray et al. [40] to prove 
that for any 62 > and £3 > there exists a process X that is jointly stationary and ergodic with X 
such that 



E 



iJfe(x'^)-/3d(a;",x^) 



< E 



ilfc(a;")-/3d(a;",x 



n ^n1 



< HiXo\Xzl) + e3-E /3d(x",x^ 



< R{X,D)+e4 + e3-/3{Di[3)+62), 



(26) 
(27) 
(28) 



where (26l relies on the definition of x" (pll, (27i is explained by observing that Hk{x"^) converges to 
H{Xq\XzI) with probability one as /c„ is increased, and (28 1 uses properties of X, i.e., H{Xo\Xzl) < 



R{X, D) + £4 and E I3d{x'^,x^) < D{(3) + 62. Note also that R{X, D) relies impUcitly on /3, and is 



identical to the R{I3) mentioned earlier. We complete the proof by combining (23 1, (24i, (25 1, (28 1, and 
the arbitrariness of 5i, 82, £2, £3, and £4 □ 

Appendix B. Proof of Theorem [2] 

The proof is similar to that by Jalali and Weissman [23, Appendix B], and we only outline the arguments 
here. While the algorithm is running, y" takes one of |3^|" possible values. These values are modeled as 
states of a Markov chain. There is a sufficiently positive probability to transition between any two states, 
because c < ci and each super-iteration of Algorithm 1 processes all n locations of y". If the temperature 
is reduced slowly enough, then the distribution of y" converges toward the stationary distribution of the 
Markov chain. The proof is completed by noting that minimal-energy states occupy all the probabiUstic 
mass of the stationary distribution. Therefore, y" converges in distribution to the set of minimal energy 
solutions, and we enjoy the same RD performance as in Theorem [l] □ 

Appendix C. Proof of Theorem[3] 

The proof is similar to the proofs of Theorems [ij and |2J Consider the sequence z" with globally 
minimal modified energy ( [TT] ), 

z" = arg min ea{w^). (29) 



We first employ arguments from Appendix A to prove that z" achieves the RD function asymptotically. 

Next, we prove that simulated annealing [34] converges to the globally optimal solution asymptotically. 

Achievable for global minimum: Recall that the adaptive algorithm uses Z with cardinality \Z\ = |3^|. 



Consider Appendix A which proves that the globally optimal data-independent reproduction alphabet 
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solution x" achieves the RD function asymptotically. Because \Z\ = \y\, there exists a one to one 
mapping from y to Z, and x" is mapped to some z". The optimal a*(-) may reduce the distortion, 

Although the quantized version a*{-) may increase the distortion, i.e., 

allocating /zlog(log(n)) bits to encode each a* (a) is sufficient to guarantee that the quantization error 



is smaller than - (see the proof of Theorem MJ in lAppendix A I . Our previous derivations (21i, (22i, (23l 
show that for any 6 > the overall distortion da{x^, z^) becomes (5-close to -D(7) as n is increased. We 
conclude from the definitions of energy Q and modified adaptive energy ( [TT] ) that 

ea(^) <e(x^) + n/35 + ^log(log(n))|3^|. 

Because |3^| = 0(log^(n)), the last term due to encoding the quantized a*(-) vanishes relative to n/35. 
Taking 5 as small as we want enables to approach min£)>o[i?(X, D) — fiD] as closely as needed, 

1 



lim supi? 

n— >oo 



n 



-\CTW{z^)\ - Pdaix"" , z^) 



Invoking the converse result of Yang et al. [22,33], 

1 



E 



n 



-\CTW{z^)\-(3da{x'',z^) 



<mm\R(X,D)- pD]. 
D>0 



mm\R(X,D) - BD]. 
D>0 



Simulated annealing: The proof is similar to that in Appendix B The only noteworthy point is that 
for each z" the quantized a*(-) is a deterministic function of z". Therefore, the simulated annealing can 
be posed as a Markov chain over \Z\"' states, where convergence in distribution to the set of minimal 
energy solutions is obtained by recognizing that for c < C2 there is a sufficiently positive probability to 
transition between any two states. D 
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